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Abstract 



This paper deals with the impact of fault prediction techniques on checkpointing strategies. We 
suppose that the fault-prediction system provides prediction windows instead of exact predictions, 
which dramatically complicates the analysis of the checkpointing strategies. We propose a new 
approach based upon two periodic modes, a regular mode outside prediction windows, and a proactive 
D mode inside prediction windows, whenever the size of these windows is large enough. We are able 

tin to compute the best period for any size of the prediction windows, thereby deriving the scheduling 

strategy that minimizes platform waste. In addition, the results of this analytical evaluation are 
nicely corroborated by a comprehensive set of simulations, which demonstrate the validity of the 
model and the accuracy of the approach. 

u 

Q 1 Introduction 

^ In this paper, we assess the impact of fault prediction techniques on checkpointing strategies. We assume 

' to have jobs executing on a platform subject to faults, and we let be the mean time between faults 

^ (MTBF) of the platform. In the absence of fault prediction, the standard approach is to take periodic 

^ checkpoints, each of length C, every period of duration T. In steady-state utilization of the platform, 

QO the value Topt of T that minimizes the expected waste of resource usage due to checkpointing is easily 

approximated as Topt = \/%lC + C, or Topt = v'2(/Lt -I- R)C + C (where R is the duration of the recovery). 
The former expression is the well-known Young's formula [IB], while the latter is due to Daly 

Assume now that some fault prediction system is available. Such a system is characterized by two 
critical parameters, its recall r, which is the fraction of faults that are indeed predicted, and its precision 
p, which is the fraction of predictions that are correct (i.e., correspond to actual faults). In the simple 
case where predictions are exact-date predictions, several recent papers |10|[T] have independently shown 

that the optimal checkpointing period becomes Topt = 1^ This latter expression is valid only when 
H is large enough and can be seen as an extension of Young's formula where is replaced by yz^: faults 
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H are replaced by non-predicted faults, and the overhead due to false predictions is negligible. A more 

accurate expression for the optimal checkpointing period is available in [T]. 

This paper deals with the realistic case (see [111 ES] and Section [s]) where the predictor system does 
not provide exact dates for predicted events, but instead provides prediction windows. A prediction 
window is a time interval of length / during which the predicted event is likely to happen. Intuitively, 
one is more at risk during such an interval than in the absence of any prediction, hence the need to 
checkpoint more frequently. But with which period? And what is the size of the prediction window 
above which it proves worthwhile to use a different (smaller) checkpointing period? 

The main objective of this paper is to provide a quantitative answer to these questions. Our key 
contributions are the following: (i) The design of several checkpointing policies that account for the 
different sizes of prediction windows; (ii) The analytical characterization of the best policy for each 
set of parameters; and (iii) The validation of the theoretical results via extensive simulations, for both 
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Exponential and Weibull failure distributions. It turns out that the analysis of the waste is dramatically 
more complicated than when using exact-date predictions [101 [I] . 

The rest of the paper is organized as follows. First we detail the framework in Scction[2] In Section[3] 
we describe the new checkpointing policies with prediction windows, and show how to compute the 
optimal checkpointing periods that minimize the platform waste. Section [4] is devoted to simulations. 
Section [5] provides a brief overview of related work. Finally, we present concluding remarks in Section [6j 



2 Framework 



2.1 Checkpointing strategy 

We consider a platform subject to faults. Our work is agnostic of the granularity of the platform, 
which may consist either of a single processor, or of several processors that work concurrently and use 
coordinated checkpointing. Checkpoints are taken at regular intervals, or periods, of length T. We denote 
by C the duration of a checkpoint; by construction, we must enforce that C < T. Useful work is done 
only during T ~ C units of time for every period of length T, if no fault occurs. Hence the waste due 
to checkpointing in a fault-free execution is Waste = ^. In the following, the waste always denote the 
fraction of time that the platform is not doing useful work. 

When a fault strikes the platform, the application is lacking some resource for a certain period of time 
of length D, the downtime. The downtime accounts for software rejuvenation (i.e., rebooting jl41 [5]) or 
for the replacement of the failed hardware component by a spare one. Then, the application recovers 
from the last checkpoint. R denotes the duration of this recovery time. 



2.2 Fault predictor 

A fault predictor is a mechanism that is able to predict that some faults will take place, within some 
time-interval window. In this paper, we assume that the predictor is able to generate its predictions 
early enough so that a proactive checkpoint can indeed be taken before or during the event. A first 
proactive checkpoint will typically be taken just before the beginning of the prediction window, and 
possibly several other ones will be taken inside the prediction window, if its size / is large enough. 

Proactive checkpoints may have a different length Cp than regular checkpoints of length C. In fact 
there are many scenarios. On the one hand, we may well have Cp > C in scenarios where regular check- 
points are taken at time-steps where the application memory footprint is minimal [13j ; on the contrary, 
proactive checkpoints are taken according to predictions that can take place at arbitrary instants. On 
the other hand, we may have Cp < C in other scenarios [H], e.g., when the prediction is localized to a 
particular resource subset, hence allowing for a smaller volume of checkpointed data. To keep full gen- 
erality, we deal with two checkpoint sizes in this paper: C for periodic checkpoints, and Cp for proactive 
checkpoints (those taken upon predictions). 

The accuracy of the fault predictor is characterized by two quantities, the recall and the precision. 
The recall r is the fraction of faults that are predicted while the precision p is the fraction of fault 
predictions that are correct. Traditionally, one defines three types of events: (i) True positive events are 
faults that the predictor has been able to predict (let Truep be their number); (ii) False positive events 
are fault predictions that did not materialize as actual faults (let Falsep be their number); and (iii) False 
negative events are faults that were not predicted (let Falser be their number). With these definitions, 
we have r = j^^^l^i,^^ and p = Trul^l^ise, ■ 

In the literature, the lead time is the interval between the date at which the prediction is made 
available, and the predicted date of failure (or, more precisely, the beginning of the prediction window). 
However, because we do not consider pro-active actions with different durations (they all have length 
Cp), we point out that the distribution of these lead times is irrelevant to our problem. Indeed, either 
we have the time to take a proactive action before the failure strikes or not. Therefore, if a failure strikes 
less than Cp seconds after the prediction is made available, the prediction was useless. In other words, 
predicted failures that come too early to enable any proactive action should be classified as unpredicted 
faults, leading to a smaller value of the predictor recall and to a shorten prediction window. Therefore, in 
the following, we consider, without loss of generality, that all predictions are made available Cp seconds 
before the beginning of the prediction window. 
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2.3 Fault rates 



The key parameter is /x, the mean time between fauhs (MTBF) of the platform. If the platform is made 
of N components whose individual MTBF is /imdi then = This result is true regardless of the fault 
distribution law[I]. In addition to /i, the platform MTBF, let fip be the mean time between predicted 
events (both true positive and false positive), and let fiNP be the mean time between unpredicted faults 
(false negative). Finally, we define the mean time between events as fie (including all three event types). 
The relationships between n, fip, fiNP, and /ie are the following: 

• Rate of unpredicted faults: — since 1 — r is the fraction of faults that are unpredicted; 

• Rate of predicted faults: ^ = since r is the fraction of faults that are predicted, and p is the 
fraction of fault predictions that are correct; 

• Rate of events: ^ = ^ + jrtp^ since events are either predictions (true or false), or unpredicted 
faults. 

3 Checkpointing strategies 

In this section, we introduce the new checkpointing strategies, and we determine the waste that they 
induce. We then proceed to computing the optimal period for each strategy. 

3.1 Description of the different strategies 

We consider the following general scheme: 

1. While no fault prediction is available, checkpoints are taken periodically with period T; 

2. When a fault is predicted, we decide whether to take the prediction into account or not. This 
decision is randomly taken: with probability g, we trust the predictor and take the prediction into 
account, and, with probability 1 — g, we ignore the prediction; 

3. If we decide to trust the predictor, we use various strategies, depending upon the length / of the 
prediction window. 

Before describing the different strategies in the situation (|3| , we point out that the rationale for not always 
trusting the predictor is to avoid taking useless checkpoints too frequently. Intuitively, the precision p of 
the predictor must be above a given threshold for its usage to be worthwhile. In other words, if we decide 
to checkpoint just before a predicted event, either we will save time by avoiding a costly re-execution 
if the event does correspond to an actual fault, or we will lose time by unduly performing an extra 
checkpoint. We need a larger proportion of the former cases, i.e., a good precision, for the predictor to 
be really useful. 

Now, to describe the strategies used when we trust a prediction (situation (3)), we define two modes 
for the scheduling algorithm: 

Regular: This is the mode used when no fault prediction is available, or when a prediction is available 
but we decide to ignore it (with probability 1 — q). In regular mode, we use periodic checkpointing with 
period Tr. Intuitively, Tr corresponds to the checkpointing period T of Section [2?T] 
Proactive: This is the mode used when a fault prediction is available and we decide to trust it, a 
decision taken with probability q. Consider such a trusted prediction made with the prediction window 
[to, to + I]- Several strategies can be envisioned: 

(1) Instant, for Instantaneous- The first strategy is to ignore the time-window and to execute the same 
algorithm as if the predictor had given an exact date prediction at time to- The algorithm interrupts the 
current period (of scheduled length Tr), checkpoints during the interval [to — Cp, tg], and then returns to 
regular mode: at time to, it resumes the work needed to complete the interrupted period of the regular 
mode. 

(2) NoCkptI, for No checkpoint during prediction window- The second strategy is intended for a short 
prediction window: instead of ignoring it, we acknowledge it, but make the decision not to checkpoint 
during it. As in the first strategy, the algorithm interrupts the current period (of scheduled length Tr), 
and checkpoints during the interval \to — Cp,to]- But here, we return to regular mode only at time to + /, 
where we resume the work needed to complete the interrupted period of the regular mode. During the 
whole length of the time-window, we execute work without checkpointing, at the risk of losing work if 
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Figure 1: Outline of Algorithm [l] (strategy WithCkptI). 



a fault indeed strikes. But for a small value of /, it may not be worthwhile to checkpoint during the 
prediction window (if at all possible, since there is no choice if / < Cp). 

(3) WithCkptI, for With checkpoints during prediction window- The third strategy is intended for a 
longer prediction window and assumes that Cp < I: the algorithm interrupts the current period (of 
scheduled length Tr), and checkpoints during the interval [tg — Cp^tg], but now also decides to take 
several checkpoints during the prediction window. The period Tp of these checkpoints in proactive mode 
will presumably be shorter than Tr, to take into account the higher fault probability. In the following, 
we analytically compute the optimal number of such periods. But we take at least one period here, hence 
one checkpoint, which implies Cp < I. We return to regular mode either right after the fault strikes 
within the time window [to, tg + /], or at time tg + / if no actual fault happens within this window. Then, 
we resume the work needed to complete the interrupted period of the regular mode. The third strategy 
is the most complex to describe, and the complete behavior of the corresponding scheduling algorithm 
is shown in Algorithm [l] 

Note that, for all strategies, we insert some additional work for the particular case where there is not 
enough time to take a checkpoint before entering proactive mode (because a checkpoint for the regular 
mode is currently on-going). We account for this work as idle time in the expression of the waste, to 
ease the analysis. Our expression of the waste is thus an upper bound. 



Algorithm 1: WithCkptI. 



if fault happens then 

After downtime, execute recovery; 
Enter regular mode; 
if in proactive mode for a time greater than or equal to I then 
I Switch to regular mode 
6 if Prediction made with interval [t, t -\- I] and prediction taken into account then 

Let tc be the date of the last checkpoint under regular mode to start no later than t — Cp\ 
if tc + C < t — Cp then (enough time for an extra checkpoint) 
I Take a checkpoint starting at time t — Cp 
else (no time for the extra checkpoint) 
I Work in the time interval [tc + C, t] 
Wreg ^ max (0, t - Cp - {tc + C)) ; 
Switch to proactive mode at time t; 
14 while in regular mode and no predictions are made and no faults happen do 
Work for a time Tn-Wreg-C and then checkpoint; 

Wreg ^ 0; 



17 while in proactive mode and no faults happen do 

18 I Work for a time Tp-Cp and then checkpoint; 
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3.2 Strategy WithCkptI 



In this section we evaluate the execution time under heuristic WithCkptI. To do so, we partition the 
whole execution into time intervals defined by the presence or absence of events. An interval starts and 
ends with either the completion of a checkpoint or of a recovery (after a failure). To ease the analysis, 
we make a simplifying hypothesis: we assume that at most one event, failure or prediction, occurs within 
any interval of length Tr + I + Cp. In particular, this implies that a prediction or an unpredicted fault 
always take place during the regular mode. 

We list below the four types of intervals, and evaluate their respective average length, together with 
the average work completed during each of them (see Table [l] for a summary) : 

1. Two consecutive regular checkpoints with no intermediate events. The time elapsed 
between the completion of the two checkpoints is exactly Tr, and the work done is exactly Tr — C. 

2. Unpredicted fault. Recall that, because of the simplifying hypothesis, the fault happens in 
regular mode. Because instants where the fault strikes and where the last checkpoint was taken are 
independent, on average the fault strikes at time Tr/2. A downtime of length D and a recovery of 
length R occur before the interval completes. There is no work done. 

3. False prediction. Recall that it happens in regular mode. There are two cases: 

(a) Taken into account. This happens with probability q. The interval lasts Tr +Cp + 1, since 
we take a proactive checkpoint and spend the time / in proactive mode. The work done is 

(b) Not taken into account. This happens with probability 1 — q. The interval lasts Tr and 
the work done is Tr — C. 

Considering both cases with their probabilities, the average time spent is equal to: q(T^^ + Cp + 1) + 
(l-g)TR ^Tn + q{Cp + I). The average work done is: qiT^i- C + 1 - ^Cp) + (1 - q)iTK- Cr) = 

Tn~C + qiI-^Cp). 

4. True prediction. Recall that it happens in regular mode. There are two cases: 

(a) Taken into account. Let Ey be the average time at which a fault occurs within the 

prediction window (the time at which the fault strikes is certainly correlated to the starting 

( f) { f) 

time of the prediction window; Ey may not be equal to 1/2). Up to time Ey , we work and 

checkpoint in proactive mode, with period Tp. In addition, we take a proactive checkpoint 

( f) 

right before the start of the prediction window. Then we spend the time Ey in proactive 
mode, and we have a downtime and a recovery. Hence, such an interval lasts Tr + Cp + 

E'-p + D + R on average. The total work done during the interval is Tr — C + a;(Tp — Cp) 
where x is the expectation of the number of proactive checkpoints successfully taken during 

the prediction window. Here, x ss -ijr 1. 

(b) Not taken into account. On average the fault occurs at time Tr/2. The time interval has 
duration Tr/2 + D + R, and there is no work done. 

Overah the time spent is (7(Tr + Cp+ e\^^ + D + R) + (1 - g)(TR/2 + D + R), and the work done 

is (7(Tr - C + (^ - l)(Tp - Cp)) + (1 - q)0. 
So far, we have evaluated the length, and the work done, for each of the interval types. We now estimate 
the expectation of the number of intervals of each type. Consider the intervals defined by an event whose 
mean time between occurrences is A. On average, during a time T, there will be T/A such intervals. Due 
to the simplifying hypothesis, intervals of different types never overlap. Table [T] presents the estimation 
of the number of intervals of each type. 

We want to estimate the total execution time. To estimate the time spent within intervals of a given 
type, we multiply the expectation of the number of intervals of that type by the expectation of the 
time spent in each of them. Of course, multiplying expectations is correct only if the corresponding 
random variables are independent. Nevertheless, we hope that this will lead us to a good approximation 
of the expected execution time. We will assess the quality of the approximation through simulations in 
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Mode 


Number of intervals 


Time spent 


Work done 






rs. 


Txt - C 


(2) 


- TIMEf„,,i 

^ Mivp 


rR/2 + + i? 





(3) 


_ (l-p)TlMEFi„al 


Tr + q{I + Cp) 


Tn-C + q{I-^Cp) 


(4) 


_ pTlMEpinal 


qiTn + E^^+Cp) 
+(1 - g)TR/2 + + i? 


^(rR-c+^^-i) (Tp-Cp)) 



Table 1: Summary of the different types of interval for WithCkptI. 



Section [4j With our assumptions we have: 

TlMEpinal = X Tr, + U>2 + D + Rj + W:^ (Tr + q{I + Cp)) 

+ W4 (^g(TR + E^^' +Cp) + {l-q)'^+D + R^ (1) 

We use the same line of reasoning to compute the overall amount of work done, that must be equal, by 
definition, to TiMEbasc, the execution time of the application without any overhead: 

TiMEbase = U.i(Tr - C) + W2 X + wJ - C + q ( I - ^C, 



+ ^«4 (^<z(^Tr-C+ l^^-lj (Tp-Cp)jj (2) 

This equation gives the value of wi as a function of the other parameters. Looking at Equations (jlj and 
([2]), and at the values of W2^ w^, and W4, we remark that TiMEpinai can be rewritten as a function of q, 

as follows: TlMEpmal = aTlMEbasc + /JTlMEpinal + q7TlMEFinal, that is TlMEpinal = TiMEbase , 

where neither a, nor /3, nor 7 depend on q. With a simple differentiation of TiMEpinai with respect 
to 5, we obtain that TiMEpinai is either increasing or decreasing with q, depending on the sign of 7. 
Consequently, in an optimal solution, either q — or q — 1. This (somewhat unexpected) conclusion is 
that the predictor should sometimes be always trusted, and sometimes never, but no in-between value 
for q will do a better job. Thus we can now focus on the two functions TiMEpinab the one when q = 
(TlME|,jj5^j), and the one when q = l (Time|,|jJ^j). 

From Table [I] and Equations ([I]) and ^ , one can easily see that 



T^™<Ll = ^^^TlMEbase + + D + R j , i.e., that 

1 - I) (1 - ?i^/^±^) TIME^°>, . TiMEba. (3) 

This is exactly the equation from [i, in the case of exact-date predictions that are never taken into 
account (a good sanity check!). When q = 1, we have: 

TIME^1>,1 

^C,)+p(^^-l)(Tp-C,)^ 

.{1} 




PTlMEp,„,, / ^ ^ ^ ^ ^ X 

MP ^ / 
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After a little rewriting we obtain: 



TimeI L, P^i\ Tp ' ' 



Final 



1 - ^ - — {p{D + R)+ rCp + (l-r)p^ + r ((l-p)/+pE^^' 
Finally, the waste is equal by definition to TiMEFi„.i--TiMEba5c ^ Therefore, we have: 



Waste = 1 - — ( 1 - ^ ) ((1 - p)J + p (e/^ - Tp 



r 



+ R)+rCp + a-r)p^+r ({l-p)I +pE\^'>^ (4) 



Waste minimization 



When (7 = 0, the optimal period can readily be computed from Equation (|3| and we derive that the 
optimal period is •\/2(/x— (D + i?))C. This defines a periodic policy we call RFO, for Refined First- 
Order approximation. We now minimize the waste of the strategy where q = I. In order to compute the 
optimal value for Tp, we identify the fraction of the waste in Equation ^ that depends on Tp. We can 
rewrite Equation Q as: 

Waste^^^ = a+ — ( ((l-p)I + pE[^^) ^+pTp] 
pp, \\ ' J Tp J 



where a does not depend on Tp. The waste is thus minimized when Tp is equal to Tp'' — V/ -^^ — 

Note that we always have to enforce that T™*'' is larger than Cp and does not exceed /, and we may 
have to round its values accordingly in some extreme cases. 

In order to compute the optimal value for Tr, we identify the fraction of the waste in Equation Q 
that depends on Tr. We can rewrite Equation Q as: 

Waste{i> = 13 + ^fl-—(p{D+R}+r(Cp + il-p)I+pEY^'jj\ + —^(5) 



Tr \ pp 

where /3 does not depend on Tr because Tp''* does not depend on Tp. Therefore, Waste^^^ is minimized 
when Tp is equal to 



Ti 



(cxtr 



2C 



(pfi - (p{D + R) + t{Cp+({1-p)I+ pE^^)) ) )) 



^ \ p{l- r) 



(6) 



Recall that we must always enforce that Tp''*'' is always greater than C. 

One can note that when r = 0, this means that none of the prediction predicts an actual fault, and 
we obtain the same period than without a predictor. Finally, if we assume that, on average, fault strikes 
at the middle of the prediction window, i.e., E^''^'' = |, we obtain simplified values: 



^ ^ l {2-p)ICp ^ ^ l 2C{pf,-{p{ D + i?) + r (Cp + (1 - f ) /))) 



pil 



3.3 Strategy NoCkptI 



In this section we evaluate the execution time under heuristic NoCkptI. The analysis is rather similar to 
that of WithCkptI, the only differences being, obviously, in the presence of true and false predictions: 
3. False prediction. There are two cases: 
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Mode 


Number of intervals 


Time spent 


Work done 


(1) 


Wi 


Tr 




(2) 


^ TiMEp.„.: 

^ fJ-NP 


Tr/2 + D + R 





(3) 


(1— p)TlMEFinal 
W3 — i ^ ^^^^ 


Tr + <z(/ + Cp) 


Tn-C + ql 


(4) 


= pTlMEp.„, 


q{Tn, + E*^') + Cp) 
+ {1 - q)Tn/2 + D + R 


q {Tr - C) 



Table 2: Summary of the different types of interval for NoCkptI. 



(a) Taken into account. This happens with probability q. The interval lasts Tr + Cp + /, 
since we take a proactive checkpoint and spend the time / in proactive mode (here, working 
without checkpointing). The work done is (Tr — C) + I. 

(b) Not taken into account. This happens with probability 1 — q. The interval lasts Tr and 
the work done is Tr — C. 

Considering both cases with their probabilities, the average time spent is equal to: q{T-R +Cp + I) + 
{l-q)TR = TR+q{Cp+I). The average work done is: q{TR~C +I) + (l-q){TR-Cr) = TR-C+qI. 
4. True prediction. There are two cases: 

(a) Taken into account. Let Ey be the average time at which a fault occurs within the 
prediction window. We take a proactive checkpoint right before the start of the prediction 

window. Then we spend the time E} in proactive mode working without checkpointing, and 

f f ) 

we have a downtime and a recovery. Hence, such an interval lasts Tr + Cp + E) + T* + i? on 
average. The total work done during the interval is Tr — C. 

(b) Not taken into account. On average the fault occurs at time Tr/2. The time interval has 
duration Tr/2 + D + R, and there is no work done. 

Overall the time spent is ^(Tr + Cp+ e'-/^ + D + R) + {1- g)(TR/2 + D + R), and the work done 
is (7(Tr - C) + (1 - q)0. 

So far, we have evaluated the length, and the work done, for each of the interval types. We now estimate 
the expectation of the number of intervals of each type as we did for WithCkptI. Table [2] presents the 
estimation of the number of intervals of each type. 

We estimate the total execution time as for WithCkptI. The formula is the exact same function of 
wi, W2, W3, and Wi (but the values of there four parameters will change as the average work done during 
some of the types of intervals changes) : 



TiMEFinal = X Tr + W2 + D + R^ + (Tr + q{I + Cp)) 

+ W4 (^q{TR + E\f'> +Cp) + {l-q)^+D + R^ (7) 
We use the same line of reasoning as previously to compute the overall amount of work done: 

TiMEbasc = W^iTR ^ C) + W2 X + W3 {Tr - C + ql) + W4 {q (Tr - C)) (8) 

This equation gives the value of wi as a function of the other parameters. As for WithCkptI, one 
can easily show that in an optimal solution, either q = or q = 1. Thus we can now focus on the two 
functions TiMEpinai, the one when q — (TiMEp*?|^j), and the one when q = 1 (TiMEpj^j^j). 



From Table [2] and Equations ([7| and Q , one can easily see that 

TIME^Ij = -^TlMEbase + ^^^^^^ (^^ + D + R ] , i.c, that 



I) (l - ^^^/^±^) TIME^°>, . TlME,a. (9) 
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This is exactly the equation from [T] in the case of exact-date predictions that are never taken into 
account, what we had already retrieved with WithCkptI (same sanity check!). When q = 1, we have: 



^Final ~ 



Tr_ 



+ 



Time 



{1} 

Final 



After a little rewriting we obtain: 

TiMEbasc 



Time 



{1} 

Final 



-{l-p)I 



1 



TlME^li Tr 



Tr - C 



((Tr-C) + (1-p)/) 



2 



D + R 



(1-p)Time^;L 



PTlME^2al 

Hp 



(Tr + I + Cp) 

(Tr + Cp + e[^^ +D + R 



C 



Finally, the waste is equal by definition to '^'^''^ t'i'mefI™'^''"''° • Therefore, we have: 



Waste = 1 (1 - p)I 

PH 









('--( 







p{D + R) + rCp + il-r)p^+r ( (1 - p)I+pE]- 



(/) 



(10) 



Waste minimization 



When q — 0, the optimal value for Tr is obviously the same than the one we computed for WithCkptI 
in the case q = 0. We now minimize the waste of the strategy where q = 1. In order to compute the 



optimal value for Tr , we identify the fraction of the waste in Equation ( 10 ) that depends on Tr . We can 



rewrite Equation ( 10 ) as: 



Waste^i} = [3 + ^ (l-— (p{D+R)+r(Cp + {l-p)I+pE\^^'^^ ^ L^I^ 



Tr \ pH \ V /// /i 2 

where /3 does not depend on Tr. This equation is identical to Equation ^ and therefore the value of 
Tr that minimizes the waste is Tr^*'', the value given by Equation (|6|. 



3.4 Strategy Instant 

In this section we evaluate the execution time under heuristic Instant. The analysis is very similar to 
that of NoCkptI. Indeed, we only focus to the differences between the performance of Instant and 
WithCkptI. The differences happening, obviously, only in the presence of true and false predictions: 

3. False prediction. There are two cases: 

(a) Taken into account. This happens with probability q. The interval lasts Tr + Cp, since we 
fallback to regular mode as soon as the proactive checkpoints completes. The work done is 
Tr~C. 

(b) Not taken into account. This happens with probability 1 — q. The interval lasts Tr and 
the work done is Tr — C. 

Considering both cases with their probabilities, the average time spent is equal to: q{TR + Cp) + 
(1 - q)TR = Tr + qCp. The average work done is: ^(Tr - C) + (1 - g)(TR - Cr) = Tr - C. 

4. True prediction. There are two cases: 
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Mode 


Number of intervals 


Time spent 


Work done 


(1) 
(2) 
(3) 

(4) 


Wi 

^ TimEf.„., 

^ MNP 
^ _ (l-p)TlMEFi„,l 

^ pTlMEp.„., 


rR/2 + + i? 

q{T^. + V}^P +Cp) 
+{l-q)Tj^/2 + D + R 




Tr-C 
9 (Tr, - C) 



Table 3: Summary of the different types of interval for Instant. 



(a) Taken into account. Let ' be the average time at which a fault occurs within the 

prediction window. We take a proactive checkpoint right before the start of the prediction 

( f\ 

window. Then we fallback to the regular mode. After a time Ej the fault strikes. Depending 
on the size / of the prediction window, and of when the prediction started after the completion 
of the last regular checkpoint three scenarios can happen. Either the fault strikes while the 
heuristic is still trying to complete the work of size Tr — C, or it strikes while the heuristic is 
trying to take the regular checkpoint after that work, or it strikes after that regular checkpoint 
was completed. We overestimate the time lost by assuming that we are in one of the two 
former cases, because these are the cases that maximizes the amount of work destroyed by a 
strike. (In some way, this is equivalent to assuming that / is very small with respect to Tr.) 
The predicted fault and the completion time of the last regular checkpoint are independent 
events. Therefore, on average the fault strikes at time T^jl. After the fault strikes, the 
downtime and the recovery we complete the period struck by the fault. Then, the interval 
lasts Tr + Cj, + E} ^ + D + i? on average. The total work done during the interval is Tr — C. 

(b) Not taken into account. On average the fault occurs at time Tr/2. The time interval has 
duration Tr/2 + T) + i?, and there is no work done. 

Overah the time spent is (7(Tr + Cp + E^-''' + T) + i?) + (1 - g)(TR/2 + D + i?), and the work done 
is (?(Tr-C). 

So far, we have evaluated the length, and the work done, for each of the interval types. We estimate 
the expectation of the number of intervals of each type as we did for WithCkptI and for NoCkptI. 
Table [3] presents the estimation of the number of intervals of each type. 
We estimate the total execution time as for WithCkptI and NoCkptI: 



TiMEFinal = Wx T^,, ^ W^i^ + D + + W-^ (Tr + qC^) 

+ [q{T^ + e/) + Cp) + (1 - g)^^ + + i? ) (11) 
We use the same line of reasoning as previously to compute the overall amount of work done: 



TiMEbasc = ?«i(Tr - C) + 102 X + U-3 (Tr - C) + W4 (<Z (Tr - C)) (12) 

This equation gives the value of W\ as a function of the other parameters. As with WithCkptI and 
NoCkptI, one can easily show that in an optimal solution, either g = or g = 1. Thus we can now focus 
on the two functions TiMEpinai, the one when g = (TiMEp^jJ^^j), and the one when g = 1 (TiMEpj^^^j). 
From Table [3] and Equations (11) and (12 1, one can easily see that 

TimeS,;., = ^JigTlMEi^. + IISSSL (^ + D + fl) , i.o., that 



Final 



TiMEbasc (13) 



This is exactly the equation from [1 in the case of exact-date predictions that are never taken into 
account, what we had already remarked with WithCkptI and NoCkptI (yet another good sanity 
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check!). When q — 1, we have 



Tr — C ^NP V 2 

+ - P^^'^-'^^- (Th + C,) + g^^^ (Th + + E^^) + D + 

After a Uttle rewriting we obtain: 

^^^^ = (l-^) (l-- (piD + R)+rC, + il~r)p^+prW' 

Finally, the waste is equal by definition to TiMEpina^i-TiMEbaBc ^ Therefore, we have: 



Waste = 1 - \^l-—j\^l-—\^p{D + R)+rCp + il-r)p^+prEY']] (14) 

Waste minimization 



When q = 0, the optimal value for Tr is obviously the same than the one we computed for WithCkptI 
and for NoCkptI in the case q — 0. We now minimize the waste of the strategy where q = I. In order 



to compute the optimal value for Tr , we identify the fraction of the waste in Equation ( 14) that depends 



on Tr. We can rewrite Equation ( 14 ) as: 

Waste^i^ = 13 + ^(l-—(piD + R)+rCp+prE''P^^ ' ^^'^^^ 



Tr \ pfi \ J J fi 2 

where /3 does not depend on Tr. Therefore, the value of Tr that minimizes the waste is Tp^*'', where 



rpcxtr 



2C (ppL - (p{D + R) + rCp + prE^')) 
\ P{1 - r) 



Again, recall that we must always enforce that T^^^ is always greater than C. Finally, if we assume 
that, on average, fault strikes at the middle of the prediction window, i.e., E^-^-* — |, we have: 



rpcx.tr 



hc{p^l~{p{D + R) + rCp+pr{)) 
p{l - r) 



4 Simulation results 

We start by presenting the simulation framework (Section |4.1|. T hen we report results using the char- 
acteristics of two fault predictors from the literature (Section |4.2[ ). 

4.1 Simulation framework 

In order to validate the model, we have instantiated it with several scenarios. The experiments use 
parameters that are representative of current and forthcoming large-scale platforms [H [7]. We take 
C = R = 600 seconds, and D = 60 seconds. We consider three scenarios where proactive checkpoints are 
(i) exactly as expensive as periodic checkpoints (Cp = C); (ii) ten times cheaper (Cp ~ O.IC); and (iii) 
two times more expensive (Cp = 2C). The individual (processor) MTBF is /iind = 125 years, and the 
total number of processors N varies from N = 2^^ = 16, 384 to iV = 2^^ — 524, 288, so that the platform 
MTBF fi varies from /i = 4, 010 min (about 2.8 days) down to p = 125 min (about 2 hours). For instance 
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the Jaguar platform, with N = 45, 208 processors, is reported to have experienced about one fault per 
day jlO], which leads to ^ind = ~ 1^5 years. The application size is set to TiMEbasc = 10,000 

years/N. 

We use Maple to analytically compute and plot the optimal value of the waste for the three prediction- 
aware policies. Instant, NoCkptI, and WithCkptI, for the prediction-ignoring policy RFO (corre- 
sponding to the case g = 0), and for the reference heuristic Daly (Daly's [6, periodic policy). In order 
to check the accuracy of our model, we have compared the analytical results with results obtained with 
a discrete-event simulator. The simulation engine generates a random trace of faults, parameterized 
either by an Exponential fault distribution or by WeibuU distribution laws with shape parameter 0.5 
or 0.7. Note that Exponential faults are widely used for theoretical studies, while Weibull faults are 
representative of the behavior of real- world platforms [TTl [T71 [T^]. In both cases, the distribution is 
scaled so that its expectation corresponds to the platform MTBF /i. With probability r, we decide if 
a fault is predicted or not. The simulation engine also generates a random trace of false predictions, 
whose distribution is identical to that of the first trace (in Figures |8] through [13) we also consider the 
case where false predictions are generated according to a uniform distribution; results are quite similar). 
This second distribution is scaled so that its expectation is equal to = r{i-p) ' inter-arrival time 
of false predictions. Finally, both traces are merged to produce the final trace including all events (true 
predictions, false predictions, and non predicted faults). Each reported value is the average over 100 
randomly generated instances. 

In the simulations, we compare the five checkpointing strategies listed above. To assess the quality of 
each strategy, we compare it with its BestPeriod counterpart, defined as the same strategy but using 
the best possible period Tr. This latter period is computed via a brute-force numerical search for the 
optimal period. Altogether, there are four BestPeriod heuristics, one for each of the three variants 
with prediction, and one for the case where we ignore predictions, which corresponds to both Daly and 
RFO. Altogether we have a rich set of nine heuristics, which enables us to comprehensively assess the 
actual quality of the proposed strategies. Note that for computer algebra plots, obviously we do not 
need BestPeriod heuristics, since each period is already chosen optimally from the equations. 

We experiment with two predictors from the literature; one accurate predictor with high recall and 
precision |19) . namely with p ~ 0.82 and r — 0.85, and another predictor with more limited recall and 
precision |21j . namely withp = 0.4 and r — 0.7. In both cases, we use five different prediction windows, of 
size / = 300, 600, 900, 1200, and 3000 seconds. Figures [2] through [7] show the average waste degradation 
of the nine heuristics for both predictors, as a function of the number of processors N. We draw the plots 
as a function of the number of processors N rather than of the platform MTBF /i — fiind/N, because 
it is more natural to see the waste increase with larger platforms; however, this work is agnostic of the 
granularity of the processors and intrinsically focuses on the impact of the MTBF on the waste. 



4.2 Analysis of the results 

We start with a preliminary remark: when the graphs for Instant and WithCkptI cannot be seen 
in the figures, this is because their performance is identical to that of NoCkptI, and their respective 
graphs are superposed. 

We first compare the analytical results, plotted by the Maple curves, to the simulations results. There 
is a good correspondence between the analytical curves and the simulations, especially those using an 
Exponential distribution of failures. However, the larger the platform (or the smaller the MTBF), the 
less realistic our assumption that no two events happen during an interval of length Tr + / + Cp, and 
the analytical models become less accurate for prediction-aware heuristics. Therefore, the analytical 
results are overly pessimistic in the most failure-prone platforms. Also, recall that an exponential law 
is a Weibull law of shape parameter 1. Therefore, the further the distribution of failures is from an 
exponential law, the larger the difference between analytical results and simulated ones. However, in all 
cases, the analytical results are able to predict the general trends. 

A second assessment of the quality of our analysis comes from the BestPeriod variants of our 
heuristics. When predictions are not taken into account, Daly, and to a lesser extent RFO, are not 
close to the optimal period given by BestPeriod (a similar observation was made in 0). This gap 
increases when the distribution is further apart from an Exponential distribution. However, prediction- 
aware heuristics are very close to BestPeriod in almost all configurations. The only exception is with 
heuristics Instant when Cp — 2C, the total number of processors N is equal to either 2^^ and 2^^, and 
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I is large. However, when / = 3000 and — 2^^, the platform MTBF is approximately equal to 6Cp 
which renders our hypothesis and analysis invalid. The difference in this case between Instant and its 
BestPeriod should therefore not come as a surprise. 

To better understand why close-to-optimal periods are obtained by prediction-aware heuristics (while 
this is not the case without predictions) , we plot the waste as a function of the period Tr for RFO and 



the prediction-aware heuristics (Figures 14 through 17). On these figures one can see that, whatever the 



configuration, periodic checkpointing policies (ignoring predictions) have well-defined global optimum. 
(One should nevertheless remark that the performance is almost constant in the neighborhood of the 
optimal period which explains why policies using different periods can obtain in practice similar per- 
formance, as in [3].) For prediction-aware heuristics, however, the behavior is quite different and two 
scenarios are possible. In the first one, once the optimum is reached, the waste very slowly increases to 
reach an asymptotic value which is close to the optimum waste (e.g., when the platform MTBF is large 
and failures follow an exponential distribution). Therefore, any period chosen close to the optimal one, or 
greater than it, will deliver good quality performance. In the second scenario, the waste decreases until 
the period becomes larger than the application size, and the waste stays constant. In other words, in 
these configurations, periodic checkpointing is unnecessary, only proactive actions matter! This striking 
result can be explained as follows: a significant fraction of the failures are predicted, and thus taken 
care of, by proactive checkpoints. The impact of unpredicted failures is mitigated by the proactive mea- 
sures taken for false predictions. To further mitigate the impact of unpredicted faults, the period Tr 
should be significantly shorter than the mean-time between proactive checkpoints, which would induce 
a lot of waste due to unnecessary checkpoints if the mean-time between unpredicted faults is large with 
respect to the mean-time between predictions. This greatly restrict the scenarios for which the periodic 
checkpointing can lead to a significant decrease of the waste. 

When the prediction window / is shorter than the duration Cp of a proactive checkpoint, there is 
no difference between NoCkptI and WithCkptI. When / is small but greater than Cp (say, when / 
is around 2Cp), WithCkptI spends most of the prediction window taking a proactive checkpoint and 
NoCkptI is more efScient. When / becomes "large" with respect to Cp, WithCkptI can become more 
efficient than NoCkptI, but becomes significantly more efficient only if the proactive checkpoints are 
significantly shorter than regular ones. Instant can hardly be seen in the graphs as its performance is 
most of the time equivalent to that of NoCkptI. 

Figures [18] through [21] show the influence of the size of the prediction window / on the performance 
of the heuristics. As expected, the smaller the prediction window, the more efficient the prediction-aware 
heuristics. Also, the smaller the number of processors (or the larger the platform MTBF), the larger the 
impact of the size of the prediction window. A surprising result is that taking prediction into account 
is not always beneficial! The analytical results predict that prediction-aware heuristics would achieve 
worse performance than periodic policies in our settings, as soon as the platform includes 2^^ processors. 
In simulations, results are not so extreme. For the largest platforms considered, using predictions has 
almost no impact on performance. But when the prediction window is very large, taking predictions into 
account can indeed be detrimental. These observations can be explained as follows. When the platform 
includes 2^^ processors, the platform MTBF is equal to 7500 s. Therefore, any interval of duration 3000 
has a 40% chance to include a failure: a prediction window of 3000 is not very informative, unless the 
precision and recall of the predictor are almost equal to 1 (which is never the case in practice) . Since the 
predictor brings almost no knowledge, trusting it may be detrimental. When comparing the performance 
of, say, NoCkptI for the two predictors, one can see that when failures follow a WeibuU distribution with 
shape parameter k — 0.7, / = 600, and N = 2^^, NoCkptI achieves better performance than RFO when 
r = 0.85 and p — 0.82, but worse when p = 0.4 and r ~ 0.7. The latter predictor generates more false 
predictions — each one inducing an unnecessary proactive checkpoint — and misses more actual failures 
— each one destroying some work. The drawbacks of trusting the predictor outweigh the advantages. If 
failures are few and apart, almost any predictor will be beneficial. When the platform MTBF is small 
with respect to the cost of proactive checkpoints, only almost perfect predictors will be worth using. For 
each set of predictor characteristics, there is a threshold for the platform MTBF under which predictions 
will be useless or detrimental, but above which predictions will be beneficial. 

In order to compare the impact of the heuristics ignoring predictions to those using them, we report 
job execution times in Table [4] For the strategies with prediction, we compute the gain (expressed in 
percentage) over Daly, the reference strategy without prediction. We first remark that RFO achieves 
lower makespans than Daly with gains ranging from 1% with 2^^ processors to 18% with 2^^ processors. 
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/ = 300 s 


/ = 1200 s 


/ = 3000 s 




2^^ procs 


2i9 procs 


2^^ procs 


2^^ procs 


2^^ procs 


2^^ procs 


Daly 


81.3 


31.0 


81.3 


31.0 


81.3 


31.0 


RFO 


80.2 (1%) 


25.5 (18%) 


80.2 (1%) 


25.5 (18%) 


80.2 (1%) 


25.5 (18%) 



p = 0.82, r = 0.85 



NoCkptI 
WithCkptI 
Instant 


66.4 (18%) 

66.4 (18%) 

66.5 (18%) 


17.0 (45%) 
17.0 (45%) 
17.0 (45%) 


67.9 (16%) 
68.3 (16%) 
68.0 (16%) 


20.2 (35%) 
20.6 (33%) 

20.3 (34%) 


71.0 (13%) 
70.6 (13%) 
70.9 (13%) 


24.7 (20%) 
23.1 (25%) 
24.1 (22%) 


p = 0.4, r = 0.7 


NoCkptI 
WithCkptI 
Instant 


70.2 (14%) 

70.2 (14%) 

70.3 (13%) 


20.6 (33%) 
20.6 (33%) 
20.9 (33%) 


71.8 (12%) 
73.6 (9%) 
72.0 (11%) 


24.2 (22%) 

25.5 (18%) 

24.6 (21%) 


75.0 (8%) 

75.1 (8%) 
75.0 (8%) 


28.7 (7%) 

26.6 (14%) 

27.7 (11%) 



Table 4: Job execution times (in days) under the different checkpointing poHcies, when failures follow a 
Weibull distribution of shape parameter 0.7. Gains are reported with respect to Daly. 





/ = 
2^^ procs 


300 s 

2 19 procs 


I = 
2^^ procs 


L200 s 

2i9 procs 


/ = . 
2^^ procs 


3000 s 

2^^ procs 


Daly 
RFO 


125.7 
120.1 (4%) 


185.0 
114.8 (38%) 


125.7 
120.1 (4%) 


185.0 
114.8 (38%) 


125.7 
120.1 (4%) 


185.0 
114.8 (38%) 


p = 0.82, r = 0.85 


NoCkptI 
WithCkptI 
Instant 


77.4 (38%) 
77.4 (38%) 
77.4 (38%) 


44.9 (76%) 
44.9 (76%) 
45.2 (76%) 


81.8 (35%) 
83.6 (33%) 
82.0 (35%) 


60.7 (67%) 
64.4 (65%) 

60.8 (67%) 


90.0 (28%) 
89.8 (29%) 
89.7 (29%) 


71.5 (61%) 
66.2 (64%) 

70.6 (62%) 


p = 0.4, r = 0.7 


NoCkptI 
WithCkptI 
Instant 


84.4 (33%) 

84.4 (33%) 

84.5 (33%) 


58.3 (68%) 
58.3 (68%) 
59.6 (68%) 


89.1 (29%) 
93.8 (25%) 
89.4 (29%) 


76.8 (58%) 
75.4 (59%) 
76.64 (58%) 


97.9 (22%) 
97.8 (22%) 
97.7 (22%) 


83.7 (55%) 
77.7 (58%) 
81.9 (56%) 



Table 5: Job execution times (in days) under the different checkpointing policies, when failures follow a 
Weibull distribution of shape parameter 0.5. Gains are reported with respect to Daly. 



Overall, the gain due to the predictions decreases when the size of the prediction window increases, 
and increases with the platform size. This gain is obviously closely related to the characteristics of the 
predictor. 

When / = 300, the three strategies are identical. When / increases, NoCkptI achieves slightly 
better results than Instant. For low values of /, WithCkptI is the worst prediction-aware heuristics. 
But when / becomes large and if the predictor is efficient, then WithCkptI becomes the heuristics of 
choice (/ = 3000, p = 0.82, and r = 0.85). 

The reductions in the application executions times due to the predictor can be very significant. With 
p = 0.85 and r = 0.82 and / = 3000, we save 25% of the total time with N = 2^^, and 13% with N = 2^^ 
using strategy WithCkptI. With / = 300, we save up to 45% with = 2'^^, and 18% with N = 2^^ 
using any strategy (though NoCkptI is shghtly better than Instant). Then, with p = 0.4 and r — 0.7, 
we still save 33% of the execution time when / — 300 and N = 2^^, and 14% with N = 2^^. The gain gets 
smaller with / = 3000 and N — 2^^ but remains non negligible since we can save 8%. When / = 3000 
and N = 2^^, however, the best solution is to ignore predictions and simply use RFO (we fall-back to 
the case q = 0). If we now consider a Weibull law with shape parameter 0.5 instead of 0.7, keeping all 
other parameters identical (/ = 3000, N = 2^^, p = 0.4 and r — 0.7), then the heuristics of choice is 
WithCkptI and the gain with respect to Daly is 57.9%. 

5 Related work 

Considerable research has been conducted on fault prediction using different models (system log analy- 
sis [19], event-driven approach [9l[T9l[21], support vector machines [16l|8]), nearest neighbors ^6j, . . . ). 
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Paper 


Lead Time 


Precision 


Recall 


Prediction Window 


m 


300 s 


40 % 


70% 


- 


m 


600 s 


35 % 


60% 


- 


m 


2h 


64.8 % 


65.2% 


yes (size unknown) 


m 


min 


82.3 % 


85.4 % 


yes (size unknown) 


m 


32 s 


93 % 


43 % 


- 


ITTI 

i 


NA 


70 % 


75 % 




m 


NA 


20 % 


30 % 


Ih 


m 


NA 


30 % 


75 % 


4h 


m 


NA 


40 % 


90 % 


6h 


m 


NA 


50 % 


30 % 


eh 


m 


NA 


60 % 


85% 


12h 



Table 6: Comparative study of different parameters returned by some predictors. 

In this section we give a brief overview of the results obtained by predictors. We focus on their results 
rather than on their methods of prediction. 

The authors of [21j introduce the lead time, that is the time between the prediction and the actual 
fault. This time should be sufhcient to take proactive actions. They are also able to give the location 
of the fault. While this has a negative impact on the precision (see the low value of p in Table [6]), they 
state that it has a positive impact on the checkpointing time (from 1500 seconds to 120 seconds). The 
authors of [19j also consider a lead time, and introduce a prediction window when the predicted fault 
should happen. The authors of (TB] study the impact of different prediction techniques with different 
prediction window sizes. They also consider a lead time, but do not state its value. These two latter 
studies motivate this work, even though jT^] does not provide the size of their prediction window. 

Unfortunately, much of the work done on prediction does not provide information that could be really 
useful for the design of efficient algorithms. These informations are those stated above, namely the lead 
time and the size of the prediction window, but other information that could be useful would be: (i) the 
distribution of the faults in the prediction window; (ii) the precision as a function of the recall (see our 
analysis); and (iii) the precision and recall as functions of the prediction window (what happens with a 
larger prediction window). 

While many studies on fault prediction focus on the conception of the predictor, most of them consider 
that the proactive action should simply be a checkpoint or a migration right in time before the fault. 
However, in their paper |15| . Li et al. consider the mathematical problem to determine when and how 
to migrate. In order to be able to use migration, they stated that at every time, 2% of the resources are 
available. This allowed them to conceive a Knapsack-based heuristic. Thanks to their algorithm, they 
were able to save 30% of the execution time compared to an heuristic that does not take the reliability 
into account, with a precision and recall of 70%, and with a maximum load of 0.7. 

In the simpler case where predictions are exact-date predictions, Gainaru et al [TU] have shown that 

the optimal checkpointing period becomes Topt — \j -^^ ' their analysis is valid only if ji is very 

large in front of the other parameters. Our previous work [1] has refined the results of [10 j . focusing on a 
more accurate analysis of fault prediction with exact dates, and providing a detailed study on the impact 
of recall and precision on the waste. As shown in Section [3j the analysis of the waste is dramatically 
more complicated when using prediction windows than when using exact-date predictions. To the best 
of our knowledge, this work is the first to focus on the mathematical aspect of fault prediction with 
prediction windows, and to provide a model and a detailed analysis of the waste due to all three types 
of events (true and false predictions and unpredicted failures). 

6 Conclusion 

In this work, we have studied the impact of prediction windows on checkpointing strategies. We have 
designed several heuristics that decide whether to trust these predictions, and when it is worth taking 
preventive checkpoints. We have been able to derive a comprehensive set of results and conclusions: 
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• Wc have introduced an analytical model to capture the waste incurred by each stratc\Q,y. and provided 
for each optimization problem a closed-form formula giving its optimal solution. Contrarily to the cases 
without prediction, or with exact-date predictions, the computation of the waste requires a sophisticated 
analysis of the various events, including the time spent irregular or proactive modes. 

• The simulations fully validate the model, and the brute-force computation of the optimal period 
guarantees that our prediction-aware strategies are always very close to the optimal. This holds true 
both for Exponential and Weibull failure distributions. 

• The model is quite accurate and its validity goes beyond the conservative assumption that requires 
a single event per time interval; even more surprising, the accuracy of the model for prediction-aware 
strategics is much better than for the case without predictions, where Daly can be far from the optimal 
period in the case of Weibull failure distributions. 

• Both the analytical computations and the simulations enable to characterize when prediction is useful, 

and which strategy performs better, given the key parameters of the system: recall r, precision p, size 
of the prediction window /, size of proactive checkpoints Cp versus regular checkpoints C, and platform 
MTBF II. 

Altogether, the analytical model and the comprehensive results provided in this work enable to fully 
assess the impact of fault prediction with time-windows on optimal checkpointing strategies. Future 
work will be devoted to refine the assessment of the usefulness of prediction with trace-based failure and 
prediction logs from current large-scale supercomputers. 
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Figure 2: Waste for the different heuristics, with p — 0.82, r = 0.85, Cp — C\ and with a trace of false 
predictions parametrized by a distribution identical to the distribution of the trace of failures. 
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Figure 3: Waste for the different heuristics, with p = 0.82, r ~ 0.85, Cp = O.IC, and with a trace of false 
predictions parametrized by a distribution identical to the distribution of the trace of failures. 
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Figure 4: Waste for the different heuristics, with p — 0.82, r — 0.85, Cp — 2C, and with a trace of false 
predictions parametrized by a distribution identical to the distribution of the trace of failures. 
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Figure 5: Waste for the different heuristics, with p = 0.4, r ~ 0.7, Cp — C, and with a trace of false 
predictions parametrized by a distribution identical to the distribution of the trace of failures. 
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Figure 6: Waste for the different heuristics, withp = 0.4, r = 0.7, Cp — O.IC, and with a trace of false 
predictions parametrized by a distribution identical to the distribution of the trace of failures. 
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Figure 9: Waste for the different heuristics, with p — 0.82, r — 0.85, Cp = O.IC, and with a trace of false 
predictions parametrized by a uniform distribution. 
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Figure 11: Waste for the different heuristics, with p = 0.4, r = 0.7, Cp = C, and with a trace of false 
predictions parametrized by a uniform distribution. 
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Figure 13: Waste for the different heuristics, withp = 0.4, r — 0.7, Cp — 2C, and with a trace of false 
predictions parametrized by a uniform distribution. 
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Figure 14: Waste as function of the period Tr for the different heuristics, with p — 0.82, r — 0.85, 
Cp — C, and with a platform of 2^^ processors. 
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Figure 15: Waste as function of the period Tr for the different heuristics, with p — 0.82, r — 0.85, 
Cp — C, and with a platform of 2^^ processors. 
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Figure 16: Waste as function of the period Tr for the different heuristics, with p = 0.4, r = 0.7, Cp — C, 
and with a platform of 2^^ processors. 



32 




10000 20000 30000 40000 
(a) Exponential 



4000 8000 

(b) Weibull k = 0.7 



12000 4000 8000 

(c) Weibull k = 0.5 






10000 20000 30000 40000 

(m) Exponential 



4000 8000 

(n) Weibull fc = 0.7 



12000 4000 8000 

(o) Weibull k = 0.5 





10000 20000 30000 40000 



(q) Exponential 



(r) Weibull = 0.7 



4000 8000 12000 



8000 12000 



(s) Weibull k = 0.5 



Figure 17: Waste as function of the period Tr for the different heuristics, with p = 0.4, r = 0.7, Cp — C, 
and with a platform of 2^^ processors. 
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Figure 18: Waste as function of the prediction window I for the different heuristics, with p = 0.82, 
r = 0.85, Cp = C, and with a platform of 2^^ processors. 
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Figure 19: Waste as function of the prediction window I for the different heuristics, with p = 0.82, 
r = 0.85, Cp — C, and with a platform of 2^^ processors. 
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Figure 20: Waste as function of the prediction window I for the different heuristics, with p ~ 0.4, r = 0.7, 
Cp = C, and with a platform of 2^^ processors. 
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Figure 21: Waste as function of the prediction window I for the different heuristics, withp — 0.4, r — 0.7, 
Cp = C, and with a platform of 2^^ processors. 
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